From the potential outcomes framework to the intent to treat (ITT) and complier average treatment effects (CACE aka LATE).
Lets define and generate the population we’re going to study.
population_size <- 1000000
base_income <- 10000
base_income_sd <- 1000
ethnicities <- c("Indian", "Asian", "Black", "Hispanic", "White", "Arabic")
genders <- c("Female", "Male")
demographics <- tibble(
gender = sample(0:1, population_size, prob = runif(length(genders)),
replace = TRUE),
ethnicity = sample(1:6, population_size, prob = runif(length(ethnicities)),
replace = TRUE)
)
How many people have we defined to be in our population?
What is their expected base income?
How many ethnicities are in our population?
population <- randomNames(population_size, return.complete.data = TRUE,
gender = demographics$gender,
ethnicity = demographics$ethnicity) %>%
mutate(id = row_number()) %>%
mutate(across(where(is.character), as.factor)) %>%
mutate(income = round(rnorm(population_size, base_income, base_income_sd))) %>%
mutate(income = income + income * (gender * 0.2)) %>%
mutate(income = income + income * (ethnicity * 0.1)) %>%
mutate(gender = map_chr(gender, ~ genders[.x + 1]),
ethnicity = map_chr(ethnicity, ~ ethnicities[.x]))
summary(population)
gender ethnicity first_name last_name
Length:1000000 Length:1000000 Michael : 10737 Johnson : 7085
Class :character Class :character Joshua : 10045 Nguyen : 6910
Mode :character Mode :character Brandon : 9018 Smith : 6826
Christopher: 8080 Martinez: 6803
Tyler : 7506 Williams: 5515
Matthew : 7255 Garcia : 4841
(Other) :947359 (Other) :962020
id income
Min. : 1 Min. : 5808
1st Qu.: 250001 1st Qu.:12194
Median : 500000 Median :13879
Mean : 500000 Mean :14090
3rd Qu.: 750000 3rd Qu.:15726
Max. :1000000 Max. :26394
population %>%
ggplot(aes(ethnicity, fill = ethnicity)) +
geom_bar()
population %>%
ggplot(aes(gender, fill = gender)) +
geom_bar()
population %>%
ggplot(aes(income)) +
geom_density() +
facet_grid(ethnicity ~ gender)
How much more/less income do males make compared to females?
On average, which ethnicity makes the most?
On average, which ethnicity makes the least?
Are there any significant differences in the proportion of genders? If so, which?
Are there any significant differences in the proportion of ethnicities? If so, which?
Now that we have a population, we will generate the effects that the policy will have on each individual. Since we have full knowledge and control of this experiment, we will know and define the treatment effect for each individual, along with both of their potential outcomes (with and without treatment). The researcher will not know these values, it is their job to recover these values.
effect_size <- 3000
effect_sd <- 1000
population <- population %>%
mutate(potential_0 = income * 0.9) %>%
mutate(potential_1 = round(
potential_0 + rnorm(population_size, effect_size, effect_sd)))
population %>%
pivot_longer(c(potential_0, potential_1)) %>%
ggplot(aes(value, fill = name)) +
geom_density(alpha = 0.2)
What the mean and standard deviation of the potential outcomes without treatment?
What the mean and standard deviation of the potential outcomes with treatment?
What is our expected average treatment effect?
As a researcher, we don’t have access to the full population. We typically have a limited set of people we can include in our research study. The first thing we need to do is obtain a study sample from the population. If we have a registry or census with everyone person from the population, we could randomly sample from it.
Now we’re the researcher and we’re given different sample sizes from the population to try to estimate our treatment effect.
sample_sizes <- 10 ^ seq(1, log(population_size, 10))
What are the different sample sizes we will work with as a researcher?
Does the researcher know the values of both potential outcomes for the individuals in their sample?
Describe, in a few sentences, how you would try to estimate the ATE from the samples we are given?
First we simulate the RCT with each of the sample sizes from above with Full Compliance.
What are the four different types of individuals in our sample and which of those individuals are assumed to exist and not exist under full compliance?
The code below simulates random assignment of treatment for each sample size and tries to calculate the observed average treatment effect.
The first row is a summary row taking the average of all of the individual rows below it. The only thing that changes from table to table is the number of individuals in that sample, notice the number of rows in each table in the bottom left hand corner. Scroll to the right to see all the columns.
invisible(lapply(sample_sizes, function(sample_size) {
experiment <- population %>%
sample_n(sample_size) %>%
# select(id, first_name, starts_with("potential_")) %>%
mutate(unit_tx_effect = potential_1 - potential_0) %>%
arrange(runif(n())) %>%
mutate(group = if_else(row_number() <= n() / 2, "treatment", "control")) %>%
arrange(id) %>%
mutate(observed_0 = if_else(group == "control", potential_0,
NA_real_),
observed_1 = if_else(group == "treatment", potential_1,
NA_real_)) %>%
bind_rows(summarise(., across(where(is.numeric), mean, na.rm = TRUE)) %>%
ungroup() %>%
mutate(across(where(is.numeric), round, digits = 1),
across(where(is.factor), ~ "<<SUMMARY>>"),
id = NA)) %>%
mutate(observed_ate = observed_1 - observed_0) %>%
arrange(desc(row_number())) %>%
select(-id) %>% print()
}))